AITopics

Country: North America > United States (0.27)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsJun-15-2026, 22:41:41 GMT

Alignment of Large Language Models with Constrained Learning

We study the problem of computing an optimal large language model (LLM) policy for the constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF and Anthropic HH-RLHF datasets.

large language model, machine learning, natural language, (18 more...)

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.87)

Industry:

Banking & Finance (1.00)
Law (0.92)
Government (0.92)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsJun-14-2026, 19:03:27 GMT

Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression

State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba's in-context learning (ICL) capabilities competitive with Transformers, a critical capacity for large foundation models. However, theoretical understanding of Mamba's ICL remains limited, restricting deeper insights into its underlying mechanisms. Even fundamental tasks such as linear regression ICL, widely studied as a standard theoretical benchmark for Transformers, have not been thoroughly analyzed in the context of Mamba. To address this gap, we study the training dynamics of Mamba on the linear regression ICL task. By developing novel techniques tackling non-convex optimization with gradient descent related to Mamba's structure, we establish an exponential convergence rate to ICL solution, and derive a loss bound that is comparable to Transformer's. Importantly, our results reveal that Mamba can perform a variant of online gradient descent to learn the latent function in context. This mechanism is different from that of Transformer, which is typically understood to achieve ICL through gradient descent emulation. The theoretical results are verified by experimental simulation.

artificial intelligence, exp, machine learning, (16 more...)

Country: Asia (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.90)

arXiv.org Machine LearningJun-11-2026

Efficient Multinomial Logistic Bandit via Frequent Directions

He, Linzhe, Zhang, Yu-Jie, Yang, Sifan, Zhang, Lijun

This paper studies efficient online algorithms for multinomial logistic bandits (MLogB), where the feedback distribution over $K+1$ outcomes follows a multinomial logistic model of $d$-dimensional action vectors. A representative UCB-type algorithm, OFUL-MLogB, achieves a regret bound of $\tilde{\mathcal{O}}(Kd\sqrt{T})$, but still requires $\mathcal{O}(K^3d^3)$ time and $\mathcal{O}(K^2d^2)$ space per round due to parameter estimation and optimistic reward construction, which is prohibitive in high-dimensional settings. To address this limitation, we propose EOFD-MLogB, which integrates frequent directions matrix sketching into OFUL-MLogB. By maintaining a low-rank SVD sketch of the accumulated Hessian, constrained online Newton updates in parameter estimation and $Kd \times K$ spectral-norm computations in the reward bonus are reduced to one-dimensional root-finding tasks and $K \times K$ eigenvalue computations, respectively. This yields dominant per-round time complexity $\mathcal{O}(Kd(m+K)^2)$ and space complexity $\mathcal{O}(Kd(m+K))$, where $m \ll d$ is the sketch size. We further prove a regret bound of $\tilde{\mathcal{O}}(Δ_T(Kd\lnΔ_T+m)\sqrt{T})$, where the sketching error factor $Δ_T$ is controlled by the $m$-truncated spectral tail of the Hessian. Thus, when the Hessian is approximately low-rank, the regret is close to that of OFUL-MLogB. Experiments validate the computational efficiency and competitive performance.

artificial intelligence, inequality, machine learning, (15 more...)

2606.11968

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

arXiv.org Machine LearningJun-8-2026

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

Zhou, Junyu, Wang, Puyu, Lei, Yunwen, Kloft, Marius, Ying, Yiming

A central mystery in deep learning is how neural networks, despite being highly non-convex and heavily overparameterized, are able to achieve near-zero training error while still generalizing well to unseen data. This paradox has sparked a surge of research aimed at understanding the convergence and generalization behavior of neural networks [1, 2, 6, 7, 15, 38, 41, 49]. The Neural Tangent Kernel (NTK), introduced by [20], has become one of a foundational tool for understanding the behavior of training dynamics for neural networks, especially those trained using gradient-based methods such as gradient descent (GD) and stochastic gradient descent (SGD). The core idea here is to linearize the neural network around its random initialization, which enables the evolution of the network during training to be closely approximated by a kernel method associated with the corresponding NTK. This framework establishes a powerful connection between the evolution of a neural network during training process and the behavior of kernel methods in a reproducing kernel Hilbert space (RKHS) induced by the NTK, allowing insights from the kernel methods to inform our understanding of neural networks. Following this perspective, the influential work [34] showed that for regression problems, shallow neural networks trained by SGD can achieve generalization performance on par with their kernel counterparts.

artificial intelligence, inequality, machine learning, (17 more...)

2606.06772

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

arXiv.org Machine LearningApr-29-2026

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Shang, Shuning, Strauss, Hubert, Wei, Stanley, Arora, Sanjeev, Razin, Noam

Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.

artificial intelligence, inequality, machine learning, (17 more...)

2604.25872

Country: Asia (0.27)

Genre: Research Report > New Finding (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsApr-25-2026, 14:03:49 GMT

3e6260b81898beacda3d16db379ed329-Supplemental.pdf

artificial intelligence, inequality, probability, (15 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Neural Information Processing SystemsApr-24-2026, 14:15:01 GMT

Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards Appendix AFormal Definition of Inhomogeneous Poisson Process

The inhomogeneous Poisson (point) process is a Poisson point process with a Poisson parameter set as some time-dependent function r(τ). Let N(a,b) represent the number of points of inhomogeneous Poisson process with intensity function r(t) occurring in the interval [a,b], then the probability of n points existing in the interval [a,b] is given by, P(N(a,b) = n) Λ(a,b)n n! In this paper, the points mean the conversions and the time-dependent intensity function r() is defined in Eq. (2) and it depends on the realization of the conversions and parameter θ. Suppose X1, Xn are independent, mean-zero, subexponential random variables, and a = (a1,,an) is an ndimensional constanst vector. We first introduce the main idea of the the PAMM algorithm.

bft, machine learning, reinforcement learning, (17 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Bok, Jinho, Li, Shuangping, Yu, Sophie H.

Detection of local geometry in random graphs: information-theoretic and computational limits

arXiv.org Machine LearningMar-26-2026

We study the problem of detecting local geometry in random graphs. We introduce a model $\mathcal{G}(n, p, d, k)$, where a hidden community of average size $k$ has edges drawn as a random geometric graph on $\mathbb{S}^{d-1}$, while all remaining edges follow the Erdős--Rényi model $\mathcal{G}(n, p)$. The random geometric graph is generated by thresholding inner products of latent vectors on $\mathbb{S}^{d-1}$, with each edge having marginal probability equal to $p$. This implies that $\mathcal{G}(n, p, d, k)$ and $\mathcal{G}(n, p)$ are indistinguishable at the level of the marginals, and the signal lies entirely in the edge dependencies induced by the local geometry. We investigate both the information-theoretic and computational limits of detection. On the information-theoretic side, our upper bounds follow from three tests based on signed triangle counts: a global test, a scan test, and a constrained scan test; our lower bounds follow from two complementary methods: truncated second moment via Wishart--GOE comparison, and tensorization of KL divergence. These results together settle the detection threshold at $d = \widetildeΘ(k^2 \vee k^6/n^3)$ for fixed $p$, and extend the state-of-the-art bounds from the full model (i.e., $k = n$) for vanishing $p$. On the computational side, we identify a computational--statistical gap and provide evidence via the low-degree polynomial framework, as well as the suboptimality of signed cycle counts of length $\ell \geq 4$.

artificial intelligence, inequality, machine learning, (19 more...)